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Abstract 

Motivation: Knowing the location of a protein witliin tlie cell is important for understanding its function, role in 
biological processes, and potential use as a drug target. Much progress has been made in developing computational 
methods that predict single locations for proteins. Most such methods are based on the over-simplifying assumption 
that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. 
While a few recent systems attempt to predict multiple locations of proteins, their performance leaves much room for 
improvement. Moreover, they typically treat locations as independent and do not attempt to utilize possible 
inter-dependencies among locations. Our hypothesis is that directly incorporating inter-dependencies among 
locations into both the classifier-learning and the prediction process can improve location prediction performance. 

Results: We present a new method and a preliminary system we have developed that directly incorporates 
inter-dependencies among locations into the location-prediction process of multiply-localized proteins. Our method 
is based on a collection of Bayesian network classifiers, where each classifier is used to predict a single location. 
Learning the structure of each Bayesian network classifier takes into account inter-dependencies among locations, 
and the prediction process uses estimates involving multiple locations. We evaluate our system on a dataset of 
single- and multi-localized proteins (the most comprehensive protein multi-localization dataset currently available, 
derived from the DBMLoc dataset). Our results, obtained by incorporating inter-dependencies, are significantly higher 
than those obtained by classifiers that do not use inter-dependencies. The performance of our system on 
multi-localized proteins is comparable to a top performing system (YLoc+), without being restricted only to 
location-combinations present in the training set. 



Background 

Knowing the location of a protein within the cell is essen- 
tial for understanding its function, its role in biological 
processes, as well as its potential role as a drug tar- 
get [1]. Experimental methods for protein localization 
such as those based on mass spectrometry [2] or green 
fluorescence detection [3], although often used in prac- 
tice, are time consuming and typically not cost-effective 
for high-throughput localization. Hence, much ongo- 
ing effort has been put into developing high-throughput 
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computational methods [4-8] to obtain proteome-wide 
location predictions. 

Over the last decade, there has been significant progress 
in the development of computational methods that pre- 
dict a single location per protein. The focus on single- 
location prediction is driven both by the data available 
in public databases such as UniProt [9], where proteins 
are typically assigned a single location, as well as by an 
(over-)simplifying assumption that proteins indeed local- 
ize to a single location. However, proteins do localize 
to multiple compartments within the cell [10-13], and 
translocate from one location to another [14]. Identi- 
fying the mutiple locations of a protein is important 
because translocation can serve some unique functions. 
For instance, GLUT4, an insulin-regulated glucose trans- 
porter, which is stored in the intracellular vesicles of 
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adipocytes, translocates to the plasma membrane in 
response to insulin [15,16]. As proteins do not localize 
at random and translocations happen between designated 
inter-dependent locations, we hypothesize that modeling 
such inter-dependencies can help in predicting protein 
locations. Thus, we aim to identify associations or inter- 
dependencies among locations and leverage them in the 
process of predicting locations for proteins. 

Several methods have been recently suggested for pre- 
dicting multiple locations for proteins. For instance. King 
and Guda introduced ngLOC [17], which uses a naive 
Bayes classifier (see e.g. [18]) to obtain a probability dis- 
tribution over locations for a query protein, where each 
location probability is computed independently. Each pro- 
tein is represented as an «-gram constructed based on 
its amino acid sequence. For a query protein, estimates 
of the conditional probabilities of the protein to be local- 
ized to each location, given its amino acid sequence, are 
determined. Using the estimates of the two most probable 
locations, a multi-localized confidence score is computed 
as a measure of the likelihood of the protein to be localized 
to both locations; if the score is above a certain threshold, 
the protein is predicted to be assigned to both locations. 
This method is limited to proteins that are localized to at 
most two locations. 

Li et al. [19] construct multiple binary classifiers, where 
each binary classifier distinguishes between a pair of loca- 
tions (one vs. one). Each binary classifier consists of an 
ensemble of /c-nearest neighbors (/c-NN) (see e.g. [20]) 
and Support Vector Machines (SVMs) (see e.g. [20,21]). 
The protein representation used in the binary classifiers is 
based on sequence-derived features (e.g. amino acid com- 
position) and gene ontology (GO) terms. The predictions 
from all the classifiers are combined to obtain a score for 
each location. A query protein is assigned to the location 
with the highest score. If multiple locations have the same 
highest score, a multi-location prediction is made and all 
the locations sharing the highest score are predicted for 
the protein. 

Several methods use variations of /:-NN to predict mul- 
tiple locations for proteins. WoLF PSORT [22,23] uses 
/:-NN with a distance measure that combines Euclidean 
and Manhattan distances, Euk-mPLoc [24] uses an ensem- 
ble of /:-NN, and iLoc-Euk [25] uses a multi-label ^-NN 
classifier. Both WoLF PSORT and Euk-mPLoc represent 
proteins based on sequence-derived features, while Euk- 
mPLoc also uses relevant GO terms. Proteins in iLoc-Euk 
are represented either using relevant GO terms or using 
features that aim to capture the likely substitutions along 
the proteins' amino acid sequences over time. Given a 
query protein, WoLF PSORT assigns it to the location- 
combination that is most common among the protein's k 
nearest neighbors, thus limiting the method to predict- 
ing location-combinations present in the training set. The 



two systems iLoc-Euk and Euk-mPLoc both compute a 
score for each location, based on the query protein. iLoc- 
Euk assigns the protein to the locations having the highest 
scores; the number of locations assigned is the same as 
that associated with the nearest neighbor protein in the 
dataset. Euk-mPLoc assigns the query protein to loca- 
tions whose score lies within a certain deviation from the 
highest score. iLoc-Euk was not extensively tested against 
existing multi-location predictors. Moreover, to achieve 
the reported level of performance, iLoc-Euk strongly relies 
on features that are only available for proteins that are 
already annotated. The performance of Euk-mPLoc was 
evaluated using an extensive dataset [26] and is the lowest 
among current multi-location predictors. Methods sim- 
ilar to iLoc-Euk were proposed for localizing subsets of 
eukaryotic proteins [27,28], virus proteins [29], and bac- 
terial proteins [30,31]. Several domain-specific systems 
using the same ideas have been introduced by the same 
group (Euk-mPLoc 2.0 [32], Hum-mPLoc 2.0 [33], Plant- 
mPLoc [34], and Virus-mPLoc [35]). 

In contrast to the approaches listed above that 
use feature-based similarity, KnowPredsite [36] uses 
sequence-based similarity to construct a collection of 
location-annotated peptide fragments and predict mul- 
tiple locations for proteins. The collection is built by 
extracting for each protein in the training dataset peptide 
fragments from its sequence and from sequences simi- 
lar to its sequence; each fragment is annotated with the 
proteins locations. The peptide fragments for a query pro- 
tein are obtained in a similar manner, and the system uses 
the location annotations of matching peptide fragments in 
the collection to compute a score for each location. Using 
the two highest location scores, a multi-localized confi- 
dence score is computed to determine if the protein is 
multi-localized. This method is restricted to predictions 
of at most two locations for a protein (similar to that seen 
earlier for ngLOC [17]). 

Notably, none of the above methods for predicting mul- 
tiple locations utilizes inter-dependencies among loca- 
tions in the prediction process. All the above models 
independently predict each single location and thus do not 
take into account predictions for other locations. 

Recent work by He et al. [37] attempts to take advantage 
of correlation among locations when predicting multiple 
locations of proteins. As part of their classifier training 
process, an imbalanced multi-modal multi-label learning 
(which they denote IMMML) classifier attempts to learn 
a correlation measure between pairs of locations that is 
later used to make the predictions. The protein repre- 
sentation used in IMMML is based on sequence-derived 
features (amino acid composition and pseudo-amino acid 
composition) and gene ontology (GO) terms. While this 
system takes into account a simple type of dependency 
among locations, namely pair-wise correlation between 
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locations, it does not account for any more complex inter- 
dependencies. Furthermore, this system was not tested on 
any extensive protein multi-localization dataset. 

YLoc"^[26], a comprehensive system for protein loca- 
tion prediction, uses a naive Bayes classifier (see e.g. [18]) 
and captures protein localization to multiple locations by 
explicitly introducing a new class for each combination of 
locations supported by the training set (i.e. having proteins 
localized to the combination). Thus, each prediction per- 
formed by the naive Bayes classifier can assign a protein 
to only those combinations of locations included in the 
training data. To produce its output, YLoc"^ transforms 
the prediction into a multinomial distribution over the 
individual locations. We also note that as the number of 
possible location-combinations is exponential in the num- 
ber of locations, training the naive Bayes classifier in this 
manner does not provide a practical model in the gen- 
eral case of multi-localized proteins, beyond the training 
set. The performance of YLoc"^ was evaluated using an 
extensive dataset [26] and is the highest among current 
multi-location predictors. 

In this paper, we present a new method that directly 
models inter-dependencies among locations and incorpo- 
rates them into the process of predicting locations for 
proteins. Our system is based on a collection of Bayesian 
network classifiers (see e.g. [38]). Each Bayesian Network 
(BN) related to each classifier corresponds to a single 
location L. Each such network is used to assign a condi- 
tional probability for a protein to be found at location L, 
given both the protein's features and information regard- 
ing the protein's other possible locations. Learning each BN 
involves learning the dependencies among the other loca- 
tions that are primarily related to proteins localizing to 
location L, For each Bayesian network classifier, its corre- 
sponding BN is learnt with the goal to improve the classi- 
fier's prediction quality. The formulation of multi-location 
prediction as classification via Bayesian networks, as well 
as the network model are presented in the next section. 
Notably, our system does not assume that all proteins it 
classifies are multi-localized, but rather more realistically, 
that proteins may be assigned to one or more locations. 

We train and test our preliminary system on a dataset 
containing single- and multi-localized proteins previously 
used in the development and testing of the YLoc"^ sys- 
tem [26], which includes the most comprehensive col- 
lection of multi-localized proteins currently available, 
derived from the DBMLoc dataset [11]. As done in other 
studies [7,8,26,39], we use multiple runs of 5-fold cross- 
validation. The results clearly demonstrate the advantage 
of using location inter-dependencies. The Fi score of 
81% and overall accuracy of 76% obtained by incorporat- 
ing inter-dependencies are significantly higher than the 
corresponding values obtained by classifiers that do not 
use inter-dependencies. Also, while our system retains 



a level of performance comparable to that of YLoc^ on 
the same dataset, we note that unlike YLoc^, by training 
the individual classifiers to predict individual — although 
inter-dependent — locations, the training of our system 
is not restricted to only those combinations of locations 
present in the dataset, thus our system is generalizable 
to multi-locations beyond those included in the training 
set. 

The rest of the paper proceeds as follows: The next 
section formulates the problem of protein subcellu- 
lar multi-location prediction and briefly provides back- 
ground on Bayesian networks and relevant notations. The 
Methods section discusses the structure, parameters, and 
inter-dependencies comprising our Bayesian network col- 
lection, and introduces the learning procedure used for 
finding them. Experiments and results follow, provid- 
ing details about the dataset, the performance evaluation 
measures, and experimental results. Last, we summarize 
our findings and outline future directions. 

Problem formulation 

As is commonly done in the context of classification, and 
protein-location classification in particular [26,39,40], we 
represent each protein, as a weighted feature vector, 
fP = i^f^^ , , , where d is the number of features. 
We view each feature as a random variable F/ represent- 
ing a characteristic of a protein, such as the presence or 
absence of a short amino acid motif [5,39], the relative 
abundance of a certain amino acid as part of amino-acid 
composition [17], or the annotation by a Gene Ontol- 
ogy (GO) term [41]. Each vector-entry, y^^, corresponds 
to the value taken by feature Fi with respect to protein 
P. In the experiments described here, we use the exact 
same representation used by Briesemeister et al. [26] as 
explained in the Experiments and results section, under 
Data preparation. 

We next introduce notation relevant to the represen- 
tation of a proteins localization. Let S = {51, ... ,5^} be 
the set of q possible subcellular components in the cell. 
For each protein P, we represent its location(s) as a vec- 
tor of 0/1 values indicating the proteins absence/presence, 
respectively, in each subcellular component. The location- 
indicator vector for protein P is thus a vector of the form: 
l^ = 1^1^^ • • • Jq) where if = 1 if P localizes to 5/ and 
if = 0 otherwise. As with the feature values, each loca- 
tion value, /f , is viewed as the value taken by a random 
variable, where for each location, 5/, the correspond- 
ing random variable is denoted by Li. Given a dataset 
consisting of m proteins along with their location vec- 
tors, we denote the dataset as: D = { (PjJ^i) I 1 < / < 
We thus view the task of protein subcellular multi- 
location prediction as that of developing a classifier 
(typically learned from a dataset D of proteins whose 
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locations are known) that given a protein P outputs a q- 
dimensional location-indicator vector that represents Fs 
localization. 

As described in the previous section, most recent 
approaches that extend location-prediction beyond a sin- 
gle location (e.g. KnowPredsite [36] and iLoc-Euk [25]), 
do not consider inter-dependencies among locations. 
YLoc'^[26] indirectly considers these inter-dependencies 
by creating a class for each location-combination. Our 
underlying hypothesis, which is supported by the exper- 
iments and the results presented here, is that directly 
capturing location inter-dependencies can form the basis 
for a generalizable approach for location-prediction. We 
discuss these inter-dependencies next. 

Consider a subset of subcellular locations 
Recall that we use the random variables Li to denote 
whether a protein is localized or not to location 5/. For- 
mally, the locations in a set, 5/^,..., 5/^, are considered 
independent if for any protein P, the joint probability 
of P to be in any of these locations can be written as 
the product of the individual location probabilities, that 



is: 



Pr 



{Li, 



f L 



7=1 



0- 



If the locations are not independent, that is, if for a 
protein P, 

Pr(l,,^/f^,...,I,,=/^)^nPr(l^. = /f), 

7=1 

then we say that these locations are inter -dependent. 

The training of a classifier for protein multi-location 
prediction involves learning such inter-dependencies so 
that the classifier can leverage them in the prediction 
process. We use Bayesian networks to model inter- 
dependencies. 

In order to develop a protein subcellular multi-location 
predictor, we propose to develop a collection of classi- 
fiers, Ci, . . . , Cq, where the classifier Q is viewed as an 
"expert" responsible for predicting the 0/1 value, /f , indi- 
cating Ps non-localization or localization to 5/. In order 
to make use of location inter-dependencies, each Q uses 
estimates of location indicators of P, (for all other loca- 
tions y, where ; ^ /), along with the feature- values of P, 
in order to calculate a prediction. We use support vector 
machines (SVMs) (e.g. [20,21]) to compute these esti- 
mates. The output of classifier Q for a protein P is given 
by 



QiP) = 



1 IfPr(/f = l|P,/f,..Jf_,Jf^„..J^)>0.5; 
0 Otherwise. 

(1) 



Further details about the estimation procedure itself are 
provided in the Methods sections, under Multiple loca- 
tion prediction. 

Bayesian networks have been used before in many 
biological applications (e.g. [42-44]). In this paper, we 
use them to model inter-dependencies among subcellular 
locations, as well as among protein-features and loca- 
tions. We briefly introduce Bayesian networks here, along 
with the relevant notations (see [45] for more details). 
A Bayesian network consists of a directed acyclic graph 
G, whose nodes are random variables, which in our case 
represent features, denoted Pi, . . . ,F^, and location indi- 
cators, denoted Li, . . . We assume here that all the 
feature values are discrete. To ensure that, we use the 
recursive minimal entropy partitioning technique pre- 
sented by Fayyad and Irani [46] and used by Dougherty 
et al. [47] to discretize the features; this technique was also 
used in the development of YLoc"^ [26]. 

Directed edges in the graph indicate inter-dependencies 
among the random variables. Thus, as demonstrated in 
Figure 1, edges are allowed to appear between feature- 
and location-nodes, as well as between pairs of location- 
nodes in the graph. Edges between location-nodes directly 
capture the inter-dependencies among locations. We 
note that there are no edges between feature-nodes 
in our model, which reflects an assumption that fea- 
tures are either independent of each other or condition- 
ally independent given the locations. This simplifying 
assumption helps speed up the process of learning the 
network structure from the data, while the other allowed 
inter-dependencies still enable much of the structure of 
the problem to be captured (as demonstrated in the 
results). Further details about the learning procedure itself 




Co 




Figure 1 An example of a collection of Bayesian network 
classifiers we learn. The collection consists of several classifiers 
C],...,Cq, one for each of the q subcellular locations. Directed edges 
represent dependencies between the connected nodes. There are 
edges among location variables (/-i , . . . , Lq), as well as between 
feature variables (Fi , . . . , F^) and location variables (/-i , . . . , Lq), but not 
among the feature variables. The latter indicates independencies 
among features, as well as conditional independencies among 
features given the locations. 



Simha and Shatkay /A/gor/f/ims for Molecular Biology 2014, 9:8 
http://www.almob.0rg/content/9/l/8 



Page 5 of 13 



are provided in the Methods section, under Learning 
Bayesian network classifiers. 

To complete the Bayesian network framework, each 
node V e {Fi, . . . . . . ,1^} in the graph is asso- 

ciated with a conditional probability table, 6y, contain- 
ing the conditional probabilities of the values the node 
takes given its parents' values, Pr(v | Pa(v)), We denote 
by 0 the set of all conditional probability tables, and 
the Bayesian network is the pair (G, 0). A consequence 
of using the Bayesian network structure is that it rep- 
resents certain conditional independencies among non- 
neighboring nodes [45], such that the joint distribution of 
the set of network variables can be simply calculated as: 

Pr (Fi, . . . . . . ,1^) = Pr (F/ | Pa(Fi)) 

(2) 

Figure 1 shows an example of a collection of Bayesian 
network classifiers. The collection consists of Bayesian 
network classifiers Q, . . . , Q, one for each of the q sub- 
cellular locations 5i, . . . ,5^, where each classifier Q con- 
sists of the graph Gi and its set of parameters 0/, (0/ 
not shown in the figure). For each classifier Q, the loca- 
tion indicator variable Li is the variable we need to predict 
and is therefore viewed as unobserved, and is shown as 
an unshaded node in the figure. The feature variables 
are given for each protein and as such are 
viewed as known or observed, shown as shaded nodes in 
the figure. Finally, the values of the location indicator vari- 
ables for all locations except for L/, [L\,...,Lq} — [Li], 
are needed for calculating the predicted value of Li in 
the classifer Q. As such, they are viewed by the classi- 
fier as though they are observed. Notably, the values of 
these variables are not known and therefore need to be 
estimated. 

Thus, the structure and parameters of the network for 
each classifier Q (learnt as described in the next section), 
are used to predict the value of each unobserved variable, 
Li, The task of each classifier Q, is to predict the value 
of the variable Li given the values of all other variables 
Fi, . . . ,F^, and {Fi, . . . ,F^} — {F/}. Since, as noted above, 
the values of the location indicator variables Fy (/ 7^ /) 
are unknown at the point when F/ needs to be calculated, 
we estimate their values, using simple SVM classifiers as 
described in the Methods section^. We note that other 
methods, such as expectation maximization, can be used 
to estimate all the hidden parameters, which we shall do 
in the future. 

Methods 

As our goal is to assign (possibly multiple) locations 
to proteins, we use a collection of Bayesian network 
classifiers, where each classifier Q, predicts the value 



(0 or 1) of a single location variable F/ - while using 
estimates of all the other location variables Lj (j ^ /), 
which are assumed to be known, as far as the classifier 
Q is concerned. The estimates of the location values Lj 
are calculated using SVM classifiers as described later 
in this section. The individual predictions from all the 
classifiers are then combined to produce a multi-location 
prediction. For each location 5/, a Bayesian network clas- 
sifier Q must be learned from the training data before 
it can be used. As described in the previous section, 
each classifier Q consists of a graph structure Gi and 
a set of conditional probability parameters, 0/, that is: 
Q = (G/, 0/). Thus, our first task is to learn the indi- 
vidual classifiers, i.e. their respective Bayesian network 
structures and parameters. The individual networks can 
then be used to predict whether a protein localizes to each 
location. 

Given a protein F, each classifier C/ needs to accurately 
predict the location indicator value /f , given the feature- 
values of P and estimates of all the other location indicator 
values ij (where j 7^ i). That is, each classifier Q in 
the collection assumes that the estimates of the location- 
indicator values, /J* for all other locations Sj (where j 7^ /) 
are already known, and is responsible for predicting only 
the indicator value /f for location 5/, given all the other 
indicator values. For a Bayesian network classifier this 
means calculating the conditional probability 

Pr(/f = l|Pjf,...,/f_i,/f+i,...J^), (3) 

under classifier C/, where /f , . . . , Z^^' - - - ^^q 
estimated using simple SVM classifiers. The classifiers 
Ci,...,Cq are each learned by directly optimizing an 
objective function that is based on such conditional prob- 
abilities, calculated with respect to the training data. 

The procedures used for learning the Bayesian network 
classifiers and to combine the individual network predic- 
tions are described throughout the rest of this section. 

Learning Bayesian network classifiers 

Given a dataset A consisting of a set of m pro- 
teins {Fi, ...,F^} and their respective location vectors 
{/^^ . . . , Z^'"}, each classifier C/ is trained so as to produce 
the "best" prediction possible for the value of the loca- 
tion indicator if (for location 5/), for any given protein P 
and a set of estimates of location indicators for all other 
locations (as shown in Equation 3 above). Based on this 
aim and on the available training data, we use the Con- 
ditional Log Likelihood (CLL) as the objective function to 
be optimized when learning each classifier G/. Classifiers 
whose structures were learnt by optimizing this objec- 
tive function were found to perform better than classifiers 



Simha and Shatkay Algorithms for Molecular Biology 2014, 9:8 
http://www.alnnob.Org/content/9/l/8 



Page 6 of 13 



that used other structures [38]. This objective function is 
defined as: 

CLL(Ci I D) 

Each Pj is a protein in the training set, and each proba- 
bility term in the sum is the conditional probability of 
protein Pj to have the indicator value /f^ (for location 5/), 
given its feature vector f^j and the current estimates for 
all the other location indicators /^^ (where k 7^ under 
the Bayesian network structure G/ for the classifier Q (see 
Equation 2). 

To learn a Bayesian network classifier that optimizes this 
objective function, we use a greedy hill climbing search. 
While Grossman and Domingos [38] proposed a heuris- 
tic method that modifies the basic search depicted by 
Heckerman et al. [48], we do not employ it in this pre- 
liminary study, but rather use the basic search, as the 
latter does not prove to be prohibitively time consum- 
ing. Our structure learner starts with an initial network 
with no directed edges. In each iteration of the hill climb- 
ing algorithm, a directed edge is either added, deleted, or 
its direction reversed. An example of each of the possible 
steps is shown in Figure 2. Notably, we do not allow the 
introduction of directed edges that connect two feature 
variables to one another. This constraint accounts for the 
assumption incorporated into the network structure, as 
discussed in the Problem formulation section, of indepen- 
dence or conditional independence among the features 
given the locations; it slightly simplifies the network struc- 
ture and reduces the search space and the overall learning 
time. 



To find estimates for the location indicator values /^^ 
we compute a one-time estimate for each indicator 
from the feature-values of the protein f^i by using an 
SVM classifier (e.g. [20,21]). We employ q SVM classi- 
fiers, SVMi, . . . , SVMq, where each SVM classifier, SVMi 
is trained to distinguish a single location indicator // from 
the rest. We use the SVM implementation provided by the 
Scikit-learn library [49] with a Radial Basis Function ker- 
nel. The rest of the network parameters are estimated as 
follows: 

Parameter learning: For each Bayesian network clas- 
sifier Q, we use the maximum likelihood estimates cal- 
culated from frequency counts in the training dataset, A 
to estimate the network parameters. For each node v in 
the graph G/, (where v may either be a feature variable or 
a location variable), we denote its n parents as Pa{y) = 
{Pai{v), . . . yPaniy)}, For each value x oi v and values 
ji, ... ,3/^ of its respective parents, the conditional proba- 
bility parameter Pr(v = x \ Pa\{v) = ji, . . .yPuniy) = jn) 
is computed as follows: Let rijoint be the number of 
proteins in the dataset D for whom the value of 
variable v is x and the values of Paiiy), . . . .Puniy) 
are Ji,...,^;?, respectively; Let nyyiarginal be the num- 
ber of proteins in the dataset D whose values of the 
variables denoted by Pa\{y), . . .,Pan{v) are ji, . . . yjn 
(regardless of the value of variable v). The maximum 
likelihood estimate for the conditional probability is 
thus: 

• L joint 

Vv{v = x\Pai{v) = yi,. . .,Pan(v) = yn) = . 

^marginal 

To avoid overfitting of the parameters, we add pseudo- 
counts to events that have zero counts (a variation on 
Laplace smoothing [50]). 

To summarize, at the end of the learning process we 
have q Bayesian network classifiers, Ci, . . . , C^, like the 
ones depicted in Figure 1 (one for each of the q locations), 
and q SVMs, SVMi^ . . .,SVMq, used for obtaining initial 
estimates for each location variable for any given protein. 
We next describe how these classifiers are used to predict 
the multi-location of a protein P, 

Multiple location prediction 

Given a protein P, whose locations we would like to pre- 
dict, we first use the SVMs to obtain preliminary estimates 
for each of its location indicator values /f , . . . , We then 
use each of the learned classifiers Q, and the preliminary 
values obtained from the SVMs to predict the value of 
the location indicator /f . The classifier outputs a value of 
either a 0 or a 1 by thresholding, as shown in Equation 5. 
The entire process is depicted in Figure 3. The conditional 
probability of /f given the feature -values of the protein 




(iii) Deleting an edge (L2,Li). (iv) Reversing an edge (^2,-^1) to (Li,L2). 

Figure 2 Adding, deleting, and reversing an edge In a Bayesian 
network during structure learning. The network on the left (i), is 
the starting point. Networl<s (ii), (iii), and (iv) show the addition, 
deletion, and reversal of an edge, respectively, as performed by the 
greedy hill climbing algorithm for structure learning. 
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(Protein feature vector) 



(Location- 
Indicator 

Estimate 
Vector) 




Input to Bayesian network 
classifiers 



\ J Output from Bayesian 
^ ^ network classifiers 



I " (^1 '^2 "'^^ 



Predicted Location-Indicator Vector 



Figure 3 Multiple location prediction for protein P. First, SVMs SVM] 



, SVMn are used to obtain tlie location indicator estimates T^, . 



Bayesian networl< classifiers Ci , . . . , Q are then used to predict the actual location indicators 
location-indicator estimates as well as with inter-dependencies among the locations. 



, /g. The Bayesian network classifiers use the 



,/J.The 



P and the estimates of the location indicator values /J* 
(where j ^ i) is first calculated as: 

Pr(/f = l|/>,/f,...,ri,/^i,...,/^) = 

Pr(/f^l,/>,/f,...,tp/^P--->/^) 

(4) 

The joint probabilities in the numerator and the denomi- 
nator of Equation 4 above are factorized into conditional 
probabilities using the Bayesian network structure, G/ (see 
Equation 2). The 0/1 prediction for each /f obtained from 
each Q becomes the value of the /'th position in the 
location-indicator vector (/^, . . . Jq) for protein P, This is 
the complete multi-location prediction for protein P, 

In the next section, we describe our experiments using 
the Bayesian network framework for predicting protein 
multi-location and the results obtained. 

Experiments and results 

We implemented our algorithms for learning and using 
a collection of Bayesian network classifiers as described 
above using Python and the machine learning library 
Scikit-learn [49] . We have applied it to a dataset contain- 
ing single- and multi-localized proteins, previously used 
for training YLoC^ [26]. Below we describe the dataset, 
the experiments, the evaluation methods we use, and 



the multiple location prediction results obtained on the 
proteins from this dataset. 

Data preparation 

In our experiments we use a dataset containing 5447 
single-localized proteins (originally published as part of 
the Hoglund dataset [39]) and 3056 multi-localized pro- 
teins (originally published as part of the DBMLoc set 
[11] that is no longer publicly available). The com- 
bined dataset was constructed and previously used by 
Briesemeister et al. [26] in their extensive comparison 
of multi-localization prediction systems. Notably, the 
protein sequences from the Hoglund dataset share no 
more than 30% sequence identity with each other, while 
sequences from the DBMLoc dataset share less than 80% 
sequence similarity with each other. We report results 
obtained over the multi-localized proteins for comparing 
our system to other published systems, since the results 
for these systems are only available for this subset [26]. 
Eor all other experiments described here, we report results 
obtained over the combined set of single- and multi- 
localized proteins. The single-localized proteins are from 
the following locations (abbreviations and number of pro- 
teins per location are given in parentheses): cytoplasm 
{cyt, 1411 proteins); endoplasmic reticulum {ER, 198), 
extra cellular space {ex, 843), golgi apparatus {gol, 150), 
lysosome {lys, 103), mitochondrion {mi, 510), nucleus 
{nuc, 837), membrane {mem, 1238), and peroxisome {per, 
157). The multi-localized proteins are from the following 
pairs of locations: cyt_nuc (1882 proteins), ex_mem (334), 
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cyt_mem (252), cyt_mi (240), nuc_mi (120), ER_ex (115), 
and ex_nuc (113). Note that all the multi-location subsets 
used have over 100 representative proteins. 

Protein representation 

We use the exact same representation of a 30-dimensional 
feature vector as used by Briesemeister et al. for YLoc^ 
[26,51], described below. However, as described later, we 
also run experiments in which we do not use annotation- 
based features (items iii and iv in the list below) in the 
protein representation. 

(i) Thirteen features derived directly from the protein 
sequence data, specifically, length of the amino acid 
chain, length of the longest very hydrophobic region, 
respective number of Methionine, Asparagine, and 
Tryptophane, occurring in the N-terminus, number 
of small amino acids occurring in the N-terminus, 
and numerical values based on: (a) ER retention 
signal, (b) peroxisomal targeting signal, (c) clusters of 
consecutive Leucines occurring in the N-terminus, 
(d) secretory pathway sorting signal, (e) putative 
mitochondrial sorting signal; 

(ii) Nine features contructed using pseudo-amino acid 
composition [52], which are based on certain physical 
and chemical properties of amino acid subsequences; 

(iii) Two annotation-based features constructed using 
two distinct groups of PROSITE patterns, one 
characteristic of plasma-membrane proteins and the 
other of nucleus proteins. For each protein, the value 
of the respective feature is 1 if the protein sequence 
contains at least one PROSITE pattern characteristic 
of the organelle, 0 otherwise; 

(iv) Six annotation-based features based on 
GO-annotations. Five of these correspond to five 
location-specific GO terms [GO:0005783 
(endoplasmic reticulum), GO:0005739 
(mitochondrion), GO:0005576 (extracellular region), 
GO:0042025 (host cell nucleus), and GO:0005778 
(peroxisomal membrane)], where the feature value is 
1 if at least one sequence homologous to the protein's 
is associated with the GO term according to 
Swiss-Prot (release 42.0), 0 otherwise. The sixth 
feature indicates the likely location of the protein 
given all the GO terms assigned to it (or to its 
homologues) in Swiss-Prot; 

(See Briesemeister et al. [26,51] for further details regard- 
ing the pre-processing, feature construction, and feature 
selection.) 

Feature discretization 

To ensure that all feature values are discrete, we use the 
minimal entropy partitioning technique as initially pre- 
sented by Fayyad and Irani [46] and used by Dougherty 



et al. [47] . We rephrase the partitioning technique by using 
concepts from Information Theory, in particular, the 
definition of conditional entropy [53]. Each continuous- 
valued feature is converted into a discrete-valued feature 
by recursively dividing the range of values that the fea- 
ture obtains into intervals; all feature values lying within 
an interval are mapped to a single discrete feature value. 

Formally, for a training set of m proteins associated 
with q locations 5i, ... ,5^, we denote the range of values 
assigned to feature^- for proteins in the set by [lf^,hf^], 
where Ij^ is the lowest value in the range and hf^ the high- 
est. A discretization boundary Ti partitions the feature 
value range [lfi>hj^] into two intervals, [If., Ti] and (Ti.hf^, 
For each protein Pj in the set (where 1 < ; < m), its fea- 
ture value for feature ^J, denoted is mapped to a value 
di iifl G [//;., Ti\ and to another value d2 iff/ ^ {Ti.hf], 
where di and d2 are two distinct values, chosen from the 
set {0, 1, 2, . . .} (e.g. di=Q and d2 = 1). 

Each location (1 < k < q), with which a protein 
Pj (whose feature value for fi isf/) may be associated, is 
viewed as a value taken by a random variable S, The con- 
ditional probability distribution of S given a feature value 
f/ and the discretization boundary Ti is defined as: 



Pr(S|//, Ti) 



Pv{S\f/ <Ti) iff/ <Ti; 
Pv(S\f/>Ti) \ifi >Ti. 



(5) 



The respective conditional entropy is denoted H{S\f I y Ti) 
[53] and defined as: 



H {S\fl, Ti) = - Pr (//■ < Ti) [ Pr (S = s^lf/ < Ti) 

xlog2iPr(s = Sk\f/ <Ti))] 

-Pv(f/ >Ti)f2[P'^{s = Sk\f/ >Ti) 
k=l 

xiog2(pr(5 = 5,iy;.' > r,))], 

where Pr(/j/ < Ti) is estimated as the proportion of 
proteins in the training set whose feature value for^^ is 
less than or equal to Ti, Pr(/J^ > Ti) is estimated as 
the proportion of proteins whose feature value for^^ is 
greater than T/, Pr(sj^\f/ < Ti) is estimated by the propor- 
tion of proteins associated with location sj^ among those 
whose feature value for^ is less than or equal to Ti, and 
Pr(5/^|^/ > Ti) is estimated by the proportion associated 
with sj^ among those proteins whose feature value forfi is 
greater than Ti, The discretization boundary Ti is chosen 
such that the conditional entropy H(S\f/ , Ti) is minimal. 
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The partitioning into intervals is applied recursively, and 
terminates when a stopping condition based on the Min- 
imum Description Length Principle, (see Fayyad and Irani 
[46] for details), is satisfied. This recursive partitioning is 
independently applied to each of the features. 

Exclusion of annotation-based features 

It has been shown by several groups [8,41,54] that protein 
subcellular location prediction performance is improved 
by incorporating features based on GO-annotations asso- 
ciated with each protein (which may also include location 
annotation) into the protein representation. However, we 
note that an important goal of protein location predic- 
tion is to assign locations to proteins that are not yet 
annotated; that is, the location-prediction tool may serve 
as an aid in the protein annotation process. Therefore, 
it is useful to be able to accurately predict location of 
proteins even without using annotation-based features 
such as PROSITE patterns and GO terms. To test the 
performance of our system with and without such fea- 
tures, we have constructed several versions of the dataset 
in which we include/exclude PROSITE-based and GO- 
based features, (i) PROSITE-GO — which includes both 
PROSITE- and GO-based features in the protein repre- 
sentation; (ii) NO'PROSITE'GO — which does not include 
any PROSITE- or GO-based features in the protein rep- 
resentation; (iii) No-PROSITE — which does not include 
PROSITE-based features, but includes GO-based features; 
and (iv) No-GO — which does not include any GO-based 
features , but includes PROSITE-based features, in the 
protein representation. These datasets are used later in 
this section (see Classification results) to demonstrate 
that location inter-dependencies can be used to improve 
prediction performance, even in the absence of PROSITE- 
based and GO-based features. 

Experimental setting and performance measures 

To compare the performance of our system to that 
of other systems (YLoc+ [26], Euk-mPLoc [24], WoLF 
PSORT [23], and KnowPredsite [36]), whose performance 
on a large set of multi-localized proteins was described 
in a previously published comprehensive study [26], we 
use the exact same dataset, employing the commonly 
used stratified 5-fold cross-validation. As the information 
about the exact 5-way splits used in previous studies is 
not available, we ran five complete runs of 5-fold-cross- 
validation (i.e. 25 runs in total), where each complete run 
of 5-fold cross-validation uses a different 5-way split. The 
use of multiple runs with different splits helps validate 
the stability and the statistical significance of the results. 
To ensure that the results obtained by using our 5-way 
splits for cross-validation can be fairly compared with 
those reported before [26], we replicated the YLoc"^ runs 
using our 5-way splits, and obtained results that closely 



match those originally reported by Briestmeister et al [26] . 
(The replicated Fi-label score is 0.69 with standard devi- 
ation diO.Ol, compared to YLoc"*" reported Fi-label score 
of 0.68, and the replicated accuracy is 0.65 with standard 
deviation ±0.01, compared to YLoc+ reported accuracy 
of 0.64). The total training time for our system is about 
11 hours (wall-clock), when running on a standard Dell 
Poweredge machine with 32 AMD Opteron 6276 proces- 
sors. Notably, no optimization or heuristics for improving 
run time were employed, as this is a one-time training. For 
the experiments described here, we ran 25 training exper- 
iments, through 5 times 5-fold cross validation, where the 
total run time was about 75 hours (wall clock). 

We use in our evaluation the adapted measures of accu- 
racy and F\ score proposed by Tsoumakas et al. [55] for 
evaluating multi-label classification. Some of these mea- 
sures have also been previously used for multi-location 
evaluation [26,37]. To formally define these measures, let 
D be a dataset containing m proteins. For a given protein 
P, let = {si I /f = 1, where 1 < i < q] he the set 
of locations to which protein P localizes according to the 
dataset, and let = {5/ \ if = 1, where I < i < q} 
be the set of locations that a classifier predicts for protein 
P, where if is the 0/1 prediction obtained (as described 
in the Methods section). The multi-label accuracy and the 
multi-label Fi score are defined as: 



Acc ■■ 



\D\ 



and 



-E 

ini ^ 



2\M^nM^ 



1^1 \^^\ + 



, respectively. 



To evaluate how well our system classifies proteins as 
localized or not localized to each individual location 5/, we 
use adapted measures of multi-label precision and recall 
denoted Prcs^ and Recs^ and defined as follows [26]: 



Pres, = 



Rec,, = 



E 



\{P e D\si e MP}\ ^ |M^I 



E 



\{P e D\si e MP}\ |M^| 

We use here the terms Multilabel-Precision and 
Multilabel-Recall to refer to PrCs^ and Recg^, respectively. 
Note that Pre^. captures the ratio of the number of cor- 
rectly predicted multiple locations to the total number of 
multiple locations predicted, and Recs^ captures the ratio 
of the number of correctly predicted multiple locations 
to the number of original multiple locations, for all the 
proteins that co-localize to location Si. Therefore, high 
values of these measures for proteins that co-localize to 
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the location 5/ indicate that the sets of predicted locations 
that include location Si are predicted correctly. 

Additionally, the Fi-label score used by Briesemeister 
et al [26] to evaluate the performance of multi-location 
predictors is computed as: 



Fi'label = — 

t^s 



2 X Pres^ X Recs^ 
Pres: + Recs: 



Finally, to evaluate the correctness of predictions made 
for each location 5/, we use the standard precision and 
recall measures, denoted by Pre-Stds^ and Rec-Stds^ (e.g. 
[7]) and defined as: 



Pre-Std.. = 



TP 



TP + FP 



and ReC'Std., = 



TP 



TP + FN 



where TP {true positives) denotes the number of proteins 
that localize to 5/ and are predicted to localize to 5/, FP 
(false positives) denotes the number of proteins that do 
not localize to 5/ but are predicted to localize to 5/, and 
FN (false negatives) denotes the number of proteins that 
localize to 5/ but are not predicted to localize to Si. 

Classification results 

Table 1 shows the Fi-label score and the accuracy of our 
system obtained when running over the PROSITE-GO 
version of the dataset (which includes both PROSITE- and 
GO-based features in the protein representation), in com- 
parison to those obtained by other predictors (as reported 
by Briesemeister et al. [26], Table Three there), using the 
same set of multi-localized proteins and evaluation mea- 
sures. While the table shows that our system has a slightly 
lower performance than YLoc^, the differences in the val- 
ues are not statistically significant (as indicated by the 
standard deviations of the scores obtained by our system), 
and the overall performance level is comparable. Thus our 
approach performs as effectively as current top-systems, 
while having the advantage of directly capturing inter- 
dependencies among locations in a generalizable manner 



(that is, without introducing a new location-class for each 
new location-combination). 

Tables 2 and 3 both show the Fi score, the Fi-label 
score, and the accuracy obtained by the SVM classifiers 
(used for computing estimates of location indicators) 
without using location inter-dependencies, compared 
with the corresponding values obtained by our system 
using location inter-dependencies, on the combined 
dataset of both single- and multi-localized proteins. 
Table 2 displays the scores obtained when running 
over the PROSITE-GO version of the dataset, whereas 
Table 3 displays the scores obtained when running over 
the No-PROSITE-GO, No-PROSITE, and No-GO ver- 
sions of the dataset (which do not include the respective 
annotation-based features in the protein representation). 
All the scores in Tables 2 and 3 obtained using inter- 
dependencies are higher (in some cases statistically signif- 
icantly) than those obtained by using SVMs alone without 
utilizing inter-dependencies. The differences are highly 
statistically significant (p <^ 0.001), as measured by the 2- 
sample t-test [56] when running over the PROSITE-GO, 
No-PROSITE, and No-GO versions of the dataset. 

Table 3 shows that location inter-dependencies improve 
multi-location prediction even when annotation-based 
features, which utilize PROSITE or GO, are not included 
in the feature set representing the protein. Furthermore, 
we see from Tables 2 and 3 that the performance of 
our system does not deteriorate substantially when run- 
ning over dataset versions that do not include vari- 
ous annotation-based features. Thus, our system shows 
robustness to the presence/absence of annotation-based 
features. 

Table 4 shows the prediction results obtained by our 
system when running over the PROSITE-GO version of 
the dataset for the five locations that have the largest 
number of associated proteins: cytoplasm (cyt), extra- 
cellular space (ex), nucleus (nu), membrane (mem), and 
mi (mitochondrion), on the combined dataset of both 
single- and multi-localized proteins. For each location 5/, 
we show the standard precision (Pre-Stdsi) and recall 



Table 1 Multi-location prediction results on the 
PROSITE-GO version of the dataset, averaged over 25 runs 
of 5-fold cross-validation, for multi-localized proteins 
only, using our system, YLoc'^[26], Euk-mPLoc [24], WoLF 
PSORT [23], and KnowPredsite [36] 



Our system YLoc+ 


Euk-mPLoc 


WoLF 


KnowPredsite 


[26] 


[24] 


PSORT [23] 


[36] 


F-i-label 0.66 (± 0.02) 0.68 


0.44 


0.53 


0.66 


Acc 0.63 (±0.01) 0.64 


0.41 


0.43 


0.63 



The Fi -label score and Acc measures shown for all the systems except for ours 
are taken directly from Table Three in the paper by Briesemeister et al. [26]. 
Standard deviations are provided for our system (not available for others). 



Table 2 Multi-location prediction results on the 



PROSITE-GO version of the dataset, averaged over 25 
runs of 5-fold cross-validation, for the combined set of 
single- and multi-localized proteins, using our system 







F-i -label 


Acc 


SVMs (without using 
dependencies) 


0.77 (± 0.01) 


0.67 (± 0.02) 


0.72 (± 0.01) 


Our system (using 
dependencies) 


0.81 (±0.01) 


0.76 (± 0.02) 


0.76 (± 0.01) 



The table shows the Fi score, the Fi -label score, and the overall accuracy {Acc) 
obtained from SVMs without using location inter-dependencies and from our 
system, which uses location inter-dependencies. Standard deviations are shown 
in parentheses. 
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Table 3 Multi-location prediction results on the No-PROSITE-GO, No-PROSITE, and No-GO versions of the dataset, 
averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins, using our 
system 





Dataset 


Fi 


F^-label 


Acc 


SVMs (without using dependencies) 


No-PROSITE-GO 


0.75 (± 0.04) 


0.66 (± 0.02) 


0.70 (± 0.04) 


Our system (using dependencies) 


No-PROSITE-GO 


0.78 (± 0.05) 


0.72 (± 0.07) 


0.73 (± 0.05) 


SVMs (without using dependencies) 


No-PROSITE 


0.77 (± 0.01) 


0.66 (± 0.02) 


0.72 (± 0.01) 


Our system (using dependencies) 


No-PROSITE 


0.80 (± 0.01) 


0.75 (± 0.02) 


0.75 (± 0.01) 


SVMs (without using dependencies) 


No-GO 


0.76 (± 0.03) 


0.67 (± 0.03) 


0.71 (± 0.03) 


Our system (using dependencies) 


No-GO 


0.79 (± 0.04) 


0.72 (± 0.08) 


0.74 (± 0.04) 



The table shows the Fi score, the Fi -label score, and the overall accuracy (yAcc) obtained from SVMs without using location inter-dependencies and from our system, 
which uses location inter-dependencies. Standard deviations are shown in parentheses. 



(ReC'Stdsi) as well as the Multilabel-Precision {Pres^) and 
Multilabel-Recall {ReCs^). The table shows values for each 
of the measures obtained by SVMs without using loca- 
tion inter-dependencies and by our system using location 
inter-dependencies. When using inter-dependencies, for 
a few locations, such as cytoplasm and membrane, the 
Multilabel-Precision (Pre^^ decreases. Nevertheless, most 
of the differences are not highly statistically significant 
(p > 0.01), as measured by the 2-sample t-test [56]. 
The Multilabel-Recall (ReCs^) increases for all locations 
with the use of inter-dependencies where the differences 
in most cases are highly statistically significant (p ^ 
0.001). We examine the statistically significant differences 
in the Multilabel-Recall for cytoplasm (3785 proteins), 
membrane (1824), and peroxisome (157). The Multilabel- 
Recall for cytoplasm (ReCcyt) increases from 0.78 when 
classifying by SVMs without using inter-dependencies, 
to 0.80 when incorporating inter-dependencies. The 
Multilabel-Recall for membrane {Recmem) increases from 
0.76 to 0.78 under similar conditions. Even for a location 



like peroxisome that has fewer associated proteins, the 
Multilabel-Recall increases from 0.37 using simple SVMs 
to 0.65 using our classifier. Our analysis demonstrates the 
advantage of using location inter-dependencies for pre- 
dicting protein locations, not just for locations that have a 
large number of associated proteins but also for locations 
that are associated with relatively few proteins. 

Discussion and conclusions 

We presented a new way to use a collection of Bayesian 
network classifiers, taking advantage of location inter- 
dependencies, to provide a generalizable method for 
predicting possible multiple locations of proteins. The 
results demonstrate that the performance of our pre- 
liminary system is comparable to the current best per- 
forming multi-location predictor YLoc"^[26]. The latter 
indirectly addresses dependencies by creating a class for 
each multi-location combination. Our results also show 
that utilizing inter-dependencies significantly improves 



Table 4 Multi-location prediction results on the PROSITE-GO version of the dataset, per location, averaged over 25 runs 
of 5-fold cross-validation, for the combined set of single- and multi-localized proteins 





cyt (3785) 


ex (1405) 


nuc (2952) 


mem (1824) 


mi (870) 


Pre Stds, (SVMs) 


0.84 (± 0.01) 


0.87 (± 0.02) 


0.79 (± 0.02) 


0.93 (±0.01) 


0.90 (± 0.03) 


Pre-Stdsi (Our system) 


0.84 (± 0.01) 


0.91 (± 0.02) 


0.79 (± 0.03) 


0.90 (± 0.01) 


0.87 (± 0.03) 


Rec Stdsi (SVMs) 


0.85 (± 0.01) 


0.64 (± 0.02) 


0.72 (± 0.02) 


0.79 (± 0.02) 


0.62 (± 0.03) 


Rec-Stdsi (Our system) 


0.86 (=b 0.01) 


0.65 (=b 0.02) 


0.74 (± 0.03) 


0.80 (=b 0.02) 


0.66 (=b 0.03) 


Presi (SVMs) 


0.82 (± 0.01) 


0.89 (± 0.02) 


0.83 (± 0.01) 


0.92 (± 0.01) 


0.87 (± 0.03) 


Prcsi t^"'' system) 


0.81 (± 0.02) 


0.91 (± 0.02) 


0.83 (± 0.01) 


0.90 (± 0.01) 


0.89 (± 0.02) 


Recsi (SVMs) 


0.78 (± 0.01) 


0.72 (± 0.02) 


0.77 (± 0.01) 


0.76 (± 0.01) 


0.68 (± 0.02) 


Recsi (Our system) 


0.80 (± 0.01) 


0.74 (± 0.02) 


0.78 (± 0.02) 


0.78 (± 0.01) 


0.73 (± 0.02) 



Results are shown for the five locations 5, that have the largest number of associated proteins (the number of proteins per location is given in parenthesis): cytoplasm 
(cyt), extracellular space (ex), nucleus (nuc), membrane (mem), and mitochondrion (mi). The table shows the per-location measures: standard precision [Pre-Stdsj), recall 
[Rec-Stdsi), Multilabel-Precision [Pres), and Multilabel-Recall [ReCs), obtained from SVMs without using location inter-dependencies and from our system using location 
inter-dependencies. For each location and measure, the highest of the values obtained from the two methods is shown in boldface. Standard deviations are shown in 
parentheses. 
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the performance of the location prediction system, with 
respect to SVM classifiers that do not use any inter- 
dependencies. Moreover, this improved performance due 
to the use of location inter-dependencies is maintained 
even when the protein representation does not include 
PROSITE patterns-based features or GO-based features, 
thus exhibiting robustness to the presence/absence of 
annotation-based features. 

In most biological applications that have used Bayesian 
networks so far (e.g. [42-44]), the variable-space typically 
corresponds to genes or SNPs which is a very large space 
and necessitates the use of strong simplifying assumptions 
and many heuristics. In contrast, we note that predict- 
ing multiple locations for proteins involves a significantly 
smaller number of variables (as the number of subcellular 
components and the number of features for represent- 
ing proteins are relatively small), making this task ideally 
suitable for the use of Bayesian networks. 

The study presented here is a first investigation into 
the benefit of directly modeling and using location inter- 
dependencies. To obtain initial estimates for location 
values, we used a simple SVM classifier, and location 
inter-dependencies were only learned based on these val- 
ues. While the results already show much improvement 
with respect to the baseline SVM classifiers, we believe 
that a better approach would be to simultaneously learn 
a Bayesian network while estimating the location values 
using iterative optimization methods such as expectation 
maximization. 

We note that although the dataset we use is the most 
extensive available collection of multi-localized proteins, 
several subcellular locations are not represented in the 
dataset at all due to the low number of proteins associated 
with them. Similarly, there is not enough data pertaining 
to proteins that are localized to more than two locations. 
We are in the process of building a set of multi-localized 
proteins that will be used in future work to test the per- 
formance of our system on new, and more complex, com- 
binations. We also plan to explore alternative approaches 
for learning models of location inter-dependencies from 
the available data. 

Endnote 

^We note that here we set out to show that capturing 
inter-dependencies among locations help improve 
prediction, and the relatively simple estimation 
procedure that we use serves sufficiently well. 
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